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ABSTRACT 

Evaluation metrics are an essential part of a ranking 
system, and in the past many evaluation metrics have 
been proposed in information retrieval and Web search. 
Discounted Cumulated Gains (DCG) has emerged as 
one of the evaluation metrics widely adopted for evalu- 
ating the performance of ranking functions used in Web 
search. However, the two sets of parameters, gain values 
and discount factors, used in DCG are determined in a 
rather ad-hoc way. In this paper we first show that DCG 
is generally not coherent, meaning that comparing the 
performance of ranking functions using DCG very much 
depends on the particular gain values and discount fac- 
tors used. We then propose a novel methodology that 
can learn the gain values and discount factors from user 
preferences over rankings. Numerical simulations illus- 
trate the effectiveness of our proposed methods. Please 
contact the authors for the full version of this work. 

1. INTRODUCTION 

Discounted Cumulated Gains (DCG) is a popular eval- 
uation metric for comparing the performance of ranking 
functions [3]. It can deal with multi-grade judgments 
and it also explicitly incorporates the position informa- 
tion of the documents in the result sets through the use 
of discount factors. However, in the past, the selection 
of the two sets of parameters, gain values and discount 
factors, used in DCG is rather arbitrary, and several 
different sets of values have been used. This is rather 
an unsatisfactory situation considering the popularity 
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of DCG. In this paper, we address the following two 
important issues of DCG: 

1. Does the parameter set matter? I.e., do different 
parameter sets give rise to different preference over 
the ranking functions? 

2. If the answer to the above question is yes, is there a 
principled way to selection the set of parameters? 

The answer to the first question is yes if there are 
more than two grades used in the evaluation. This is 
generally the case for Web search where multiple grades 
are used to indicate the degree of relevance of documents 
with respect to a query. We then propose a principled 
approach for learning the set of parameters using pref- 
erences over different rankings of the documents. As 
will be shown the resulting optimization problem for the 
learning the parameters can be solved using quadratic 
programming very much like what is done in support 
vector machines for classification. We did several nu- 
merical simulations that illustrate the feasibility and ef- 
fectiveness of the proposed methodology. We want to 
emphasize that the experimental results are preliminary 
and limited in its scope because of the use of the sim- 
ulation data; and experiments using real-world search 
engine data are being considered. 

2. RELATED WORK 

Cumulated gain based measures such as DCG ^ have 
been applied to evaluate information retrieval systems. 
Despite their popularity, little research has been focused 
on analyzing the coherence of these measures to the best 
of our knowledge. The study of |7j shows that different 
gain values of DCG can raise different judgements of 
ranking lists. In this study, we first prove that the DCG 
is incoherency and then propose a principled method to 
learn the DCG parameters as a linear utility function. 

Learning to rank attracts a lot of research interests in 
recent years. Several methods have been developed to 
learn the ranking function through directly optimization 
performance metrics such as MAP and DCG [5][8l[9]. 



These studies focus on learning a good ranking function 
with respect to given performance metrics, while the 
goal of this paper is to analysis coherence of DCG and 
propose a learning method to determine the parameters 
of DCG. 

As we have mentioned in Section [5l DCG can be 
viewed as a linear utility function. Therefore, the prob- 
lem of learning DCG is closely related to the problem of 
learning the utility function. Learning utility function 
is studied under the name of conjoint analysis by the 
market science community [TJ|6]. The goal of conjoint 
analysis is to model the users' preference over products 
and infer the features that satisfy the demands of users. 
Several methods have been proposed to model solve the 
problem [2(3]. 

3. DISCOUNTED CUMULATED GAINS 

We first introduce some notation used in this pa- 
per. We are interested in ranking A'^ documents X = 
{xi, . . . , sjv}. We assume that we have a finite ordinal 
label (grade) set £ = . . . ,£l}- We assume that £i 
is preferred over — 1, . . . , L ~ 1. In Web search, 

for example, we can have 

C — {Perfect, Excellent, Good, Fair, Bad}, 

i.e., L = 5. A ranking of A" is a permutation 

^= (^(1),...,^(7V)), 

of (1, ... , A''), i.e., the rank of under the ranking n 
is i. 

For each label is associated a gain value gi = g{li), 
and gi,i = 1,...,L constitute the set of gain values 
associated with C. The DCG for tt with the associated 
labels is computed as 

K 

DCGg,x(7r) = ^c,5,(,), K = 1,...,N, 

i=l 

where ci > C2 > ■ ■ ■ > ca- > are the so-called discount 
factors [4]. 

The gain vector g = [g-i, . . . , gL\ is said to be compati- 
ble ii gi > g2 >■■■> gL- If two gain vectors g^ and g^ 
are both compatible, then we say they are compatible 
with each other. In this case, there is a transformation 
4> such that 

<t>{9t) ^ gf , i = l,...,L, 
and the transformation is order preserving, i.e., 

<f>{gi) > 't>{gj), ifgi>gj- 

4. INCOHERENCY OF DCG 

Now assume there are two rankers A and B using 
DCG with gain vectors g^ and g^ , respectively. We 
want to investigate how coherent A and B are in evalu- 
ating different rankings 



4.1 Good News 

First the good news: if A and B are compatible, then 
A and B agree on which set of rankings is optimal, i.e., 
which set of rankings have the highest DCG. We first 
state the following well-known result. 

Proposition 1. Let ai > • • • > ajv and 6i > ■ ■ ■ > 
Bn- Then 

JV JV 

Oibi = max ^ Oib^^^i) . 

i=l 1=1 

It follows from the above result that any ranking tt 
such that 

57r(l) > 57r{2) > • • • > gT,{K) 

achieves the highest DCGg^K, as long as the gain vector 
g is compatible. 

How about those rankings that have smaller DCGs? 
We say two compatible rankers A and B are coherent, if 
they score any two rankings coherently, i.e., for rankings 
TTi and TT2, 

if and only if 

DCGgB^Ki-^i) > DCGgB^K{-^2), 

i.e., ranker A thinks tti is better than 7r2 if and only if 
ranker B thinks tti is better than 712. Now the question 
is whether compatibility implies coherency. We have 
the following result. 

Theorem. If L — 2, then compatibility implies co- 
herency. 

Proof. Fix K > 1, and let 

K 
i = l 

When there are only two labels, let the corresponding 
gains be g\,g2- For a ranking tt, define 

Ci(^) = ^ Ci, C2(7r) = c - Ci(7r). 

7r{i)=«i 

Then 

DCGg,K{n) = ci{-K)gi + C2{n)g2. 
For any two rankings tti and tv2, 

DCG^A^A^i)>DGG^A^A^2) 

implies that 

ci{'^i)gi + C2{-Ki)g2 > Ci{ni)gi + C2{ni)gi 
which gives 

(ci(^i)-ci(^2))(5i'-P2^) >0. 
Since A and B are compatible, the above implies that 
(ci(7ri) -Ci(7r2))(3f -pf) > 0. 



Therefore DCGgB^jiim) > DCGgB^jiini). The proof 
is completed by exchange A and B in the above argu- 
ments. 

4.2 Bad News 

Not too surprisingly, compatibility does not imply co- 
herency when L > 2. We now present an example. 
Example. Let X = {xi,X2,X3}, i.e., N = 3. We 

consider DCGg,K with K = 2. Assume the labels of 
xi,X2, X3 are £2,(^1,^3, and for ranker A, the correspond- 
ing gains are 2, 3, 1/2. The optimal ranking is (2, 1, 3). 
Consider the following two rankings, 

7ri = (1,3,2), 7r2 = (3,2,l) 

None of them is optimal. Let the discount factors be 

ci = !-!-€, C2 = l-€, l/4<e<l. 

It is easy to check that 

DCGgA^^i^i) = 2ci -I- (1/2)C2 > 

> (l/2)ci -h 3C2 = DCGgA, 2 (7r2). 

Now let = 0(5^), where (t>(t) = t'' . and (f) is cer- 
tainly order preserving, i.e., A and B are compatible. 
However, it is easy to see that for k large enough, we 
have 

2* ci (1/2)^2 < (1/2)^1 -h 3*02 
which is the same as 

DCG,B,2(^i) < DCGgB^2{^2). 

Therefore, A thinks tti is better than 7r2 while B thinks 
7r2 is better than tti even though A and B are compat- 
ible. This implies A and B are not coherent. 

4.3 Remarks 

When we have more than two labels, which is the case 
for Web search, using DGGk with K > 1 to compare 
the DCGs of various ranking functions will very much 
depend on the gain vectors used. Different gain vectors 
can lead to completely different conclusions about the 
performance of the ranking functions. 

The current choice of gain vectors for Web search is 
rather ad hoc, and there is no criterion to judge which 
set of gain vectors are reasonable or natural. 

5. LEARNING GAIN VALUES AND DIS- 
COUNT FACTORS 

DCG can be considered as a simple form of linear 
utility function. In this section, wo discuss a method to 
learn the gain values and discount factors that consti- 
tute this utility function. 

5.1 A Binary Representation 

We consider a fixed K, and we use a binary vector 
s{tt) of dimension x L to represent a ranking tt con- 
sidered for DCGg^K- Here L is the number of levels 



of the labels. In particularly, the first L-components 
of s correspond to the first position of the /f-position 
ranking in question, and the second L-components the 
second position, and so on. Within each L-componcnts, 
the i-th component is 1 if and only if the item in position 
one has label £i,i — 1, . . . ,L. 

Example. In the Web search case, L = 5, suppose 
we consider DCGg^s, and for a particular ranking tt the 
labels of the first three documents are 

Perfect, Bad, Good. 

Then the corresponding 15-dimensionaI binary vector 
s(7r) is 

[1,0,0,0,0, 0,0,0,0,1, 0,0,1,0,0]. 

We postulate a utility function u{s) = w^s which a 
linear function of s, and w is the weight vector, and we 
write 

W = [Wl,l, . . . , W2,l, ■ ■ ■ ,W2,L, Wfe.l, • • • ,Wk,l]. 

We distinguish two cases. 

Case 1. The gain values are position independent. 
This corresponds to the case 

= CiQj, ,i = l,...,K,j = \,...,L. 

This is to say that Ci,i = 1,...,K are the discount 
factors, and gj,j = 1,...,L are the gain values. It is 

easy to see that 

M)^s(7r) = DCGg,KiTv). 

Case 2. In this framework, we can consider the more 
general case that the gain values are position dependent. 
Then . . . , wi^l are just the products of the discount 
factor ci and the position dependent gain values for po- 
sition one, and so on. In this case, there is no need to 
separate the gain values and the discount factors. The 
weights in the weight vector w are what we need. 

5.2 Learning w 

We assume we have available a partial set of pref- 
erences over the set of all rankings. For example, we 
can present a pair of rankings tti and n2 to a user, and 
the user prefers tti over n2, denoted by tti >- n2, which 
translates into w^s(7ri) > 5(772). Let the set of pref- 
erences be 

In the second case described above, we can formulate 

the problem as learning the weight vector w subject to 
a set of constraints (similar to rank SVM): 

min w^w + G \^ (1) 

subject to 

w'^ {s{-!Ti)-s{-!Tj))>l-^ij, iij>0, {i,j)eS. 



Wki>WkA+i, k = l,...,K, l = l,...,L-l 

For the first case, we can compute w as in Case 2, 
and tlien find Ci and gj to fit lii. It is also possible to 
carry out hypothesis testing to see if the gain values are 
position dependent or not. 

6. SIMULATION 

In this section, we report the results of numerical sim- 
ulations to show the feasibility and effectiveness of the 
method proposed in Equation U]). 

6.1 Experimental Settings 

We use a ground-truth w to obtain preference of rank- 
ing lists. Our goal is to investigate whether we can re- 
construct w via learning from the preference of ranking 
lists. The ground-truth w is generated according to the 
following equation: 

log(fc + 1) 

For a comprehensive comparison, we distinguish two 
settings of G;. Specifically, we set Gi — I in the first 
setting (Data 1), and G; = 2' — 1 in the second setting 
(Data 2). 

The ranking lists are obtained by randomly permut- 
ing a ground-truth ranking list. For example, the rank- 
ing lists can be generated by permuting the list [5, 5, 4, 4, 
3, 3, 2, 2, 1, 1] randomly. We randomly generate different 
numbers of pairs of ranking lists and use the ground- 
truth to judge which ranking lists is preferred. Specifi- 
cally, if w'^ s{'Ki) > w^s(7r2), we have a preference pair 
TTi ;^ 7r2. Otherwise we have a preference pair 712 y ni. 

6.2 Evaluation Measures 

Given the estimated w and the ground-truth w, we 
apply two measures to evaluate the quality of the esti- 
mated w. 

The first measure is the precision on a test set. A 
number of pairs of ranking lists are generated as the 
test set. We apply w to predict the preference over 
the test set. Then the precision w is calculated as the 
proportion of correctly predicted preference in the test 
set. 

The second measure is the similarity of w and w. 
Given the true value of w and the estimation w de- 
fined by the above optimization problem, the similarity 
between w and w can be defined as follows; 

T{w) = [wii - WiL, , ■ ■ ■ ,WiL — WiL, 

. . . ,Wki — WKL, ■ ■ ■ ,'WKL - Wkl) (3) 

We can observe that the transformation T preserve the 
orders between ranking lists, i.e., r(w)"^s(7ri) > r(u;)"^s(7r2) 
iff w'^ s(-jTi) > w'^s{tv2). The similarity between w and 

w is measured by simfm, w) = nX^^^iinl^^^lii . 



6.3 General Performance 

We randomly sample a number of ranking lists and 
generate preference pairs according to a ground-truth 
w. The number of preference pairs in training set ranges 
from 20 to 200. We plot the precision and similarity of 
the estimated w with respect to the number of training 
pairs in Figure [T] and Figure [S] It can be observed from 
Figure [1] and [2] that the performance generally grows 
with the increasing of training pairs, indicting that the 
preference over ranking lists can be utilized to enhance 
the estimation of the unity function w. Another ob- 
servation is that the when about 200 preference pairs 
are included in training set, the precisions in test sets 
become close to 95% under both settings. This obser- 
vation suggests that we can estimate w precisely from 
the preference of ranking lists. We also notice that the 
similarity and precision sometimes give different conclu- 
sions of the relative performance over Data 1 and Data 
2. We think it is because the similarity measure is sen- 
sitive to the choice of the offset constant. For example, 
large offset constants will give similarity very close to 1. 
Currently, we use wil, ■ ■ ■ , wkl of the offset constant as 
in Equation ([3]). Generally, we refer the precision as a 
more meaningful evaluation metric and report similarity 
as a complement to precision. 

After the utility function w is obtained, we can recon- 
struct the gain vector and the discount factors from w. 
To this end, we rewrite w as a matrix W of size L x K. 
Assume that singular value decomposition of the ma- 
trix W can be expressed as = f/diag((Ti, . . . , (t„)V"^ 
where cti > • ■ • > (t„. Then, the rank-1 approxima- 
tion of Vl^ is = aiuivf . In this case, the first left 
singular vector ui is the estimation of the gain vector 
and the first right singular vector vi is the estimation of 
the discount factors. We plot the estimated gain vector 
and discount factors with respect to their true values 
in Figure |4] and Figure |3l respectively. Note the per- 
fect estimations give straight lines in these two figures. 
We can see that the discount factors for the top-ranked 
positions are more close to a straight line and thus are 
estimated more accurately. This is because the discount 
factors of top-ranked positions have greater impact to 
the preference of the ranking lists. Therefore, these dis- 
count factors are captured more precisely by the con- 
straints. The similar phenomenon can also be observed 
for the gain vector. 

6.4 Noisy Settings 

In real world scenarios, the preference pairs of rank- 
ing lists can be noisy. Therefore, it is interesting to 
investigate the effect of the noisy pairs to the perfor- 
mance. To this end, we fix the number training pairs 
to be 200 and create the noisy pairs by randomly flip a 
number of pairs in the training set. In our experiments, 
the number of noisy pairs ranges from 5 to 40. Since 
the trade off value G is important to the performance 
in the noisy setting, we select the value of G that shows 
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Figure 1: Precision over the test set with respect 
to the number of training pairs 



Figure 2: Similarity between w and w with re- 
spect to the number of training pairs 




0.2 0.3 0.4 0.5 0.6 0.7 
true discount factors 




10 15 20 

true gain vector 



Figure 3: Estimated discount factors with re- 
spect to true discount factors 



Figure 4: The estimated gain vector with respect 
to the true gain vector 



the best performance on an independent validation set. 
We report the performance with respect to the number 
of noisy pairs in Figure [5] and Figure [6] We can ob- 
serve that the performance decreases when tire number 
of noisy pairs grows. 

In addition to the noisy preference pairs, we also con- 
sider the noise in the grades of documents. In this case, 
we randomly modify the grades of a number of docu- 
ments to form noise in training set. The estimated w is 
used to predict to preference on a test set. The precision 
with respect to the number of noisy documents is shown 
in Figure [T] It can be observed that the performance 
decreases when the number of noisy documents grows. 

6.5 Optimal Rankings 

We can further restrict the preference pairs by involv- 
ing the optimal ranking in each pair of training data. 
For example, we set one ranking list of the preference 



pair to be the optimal ranking [5,5,4,4,3,3,2,2,1,1]. 
In this case, if the other ranking list is generated by 
permuting the same list, it is implied by Proposition 1 
that any compatible gain vectors will agree on the opti- 
mal ranking is preferred to other ranking lists. In other 
words, the preference pairs do not carry any constraints 
to the utility function w. Therefore, the constraints cor- 
responding to these preference pairs are not effective in 
determining the utility function w. Consequently, the 
performance do not increase when the number of train- 
ing pair grows as shown in Figure [9] 

If the ranking lists contain different sets of grades, 
a fraction of constraints can be effective. The perfor- 
mance grows slowly with the number of training pairs 
increases as reported in Figure fTOl By comparing Figure 
1121 and Figure 1101 we can observe that when the type 
of preference is restricted, the learning algorithm re- 
quires more pairs to obtain a comparable performance. 
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Figure 5: Precision over the test set with respect Figure 6: Similarity between w and w with re- 



to the number of noisy pairs in noisy settings 



spect to the number of noisy pairs in noisy set- 
tings 
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Figure 7: Precision with respect to the number Figure 8: Similarity with respect to the number 
of noisy grades in noisy settings of noisy grades in noisy settings 



We conclude from this observation that some pairs are 
more effective than others to determine w. Thus, if we 
can design algorithm to select these pairs, the number of 
pairs required for training can be greatly reduced. How 
to design algorithms to select effect preference pairs for 
learning DCG will be addressed as a future research 
topic. 

7. AN ENHANCED MODEL 

The objective function defined in Equation ((1} does 
not consider the degree of difference between ranking 
lists. For example, it deals with preference pairs [5, 5, 4, 2, 1] 
y- [5, 4, 5, 2, 1] and [5, 5, 4, 2, 1] y [2, 1, 5, 5, 4] in the same 
approach, although they have great differences in DCG. 
In order to overcome this problem, we propose an en- 
hanced model that takes the degree of difference be- 
tween ranking list into consideration. 



T 

mm 10 w 



subject to: 



(4) 



w {s{ni) ~ s{tVj)) >Dist{ni,nj) - ^ij {i,j)£S 

C., >0 {i,j)es 
Wki > Wk,i+i k = 1, . . . , K and / = 1, . . . , L — 1 

where Dist() is a distance measure for a pair of per- 
mutations TTi and 712. In principle, we would prefer 
Dist(7ri, TTj) to be a good approximation of DCG{ni) — 
DCGijTj). However, since we do not actually know the 
ground-truth w in practice, it is generally difficult to 
obtain a precise approximation. In our simulation, we 
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Figure 9: Performance when training pairs are Figure 10: Performance when training pairs are 
generated by permuting the same list generated from different lists 



apply the Hamming distance as the distance measure: 

Ham(7ri,7r2) = ^l[7ri(fc) /TTaW] (5) 
fe 

We perform simulation to evaluate the enhanced model. 
Form Figure [12] and Figure 1111 we can observe that 
the performance improvement of the enhanced model 
is not very significant. We doubt that this is because 
Hamming distance is not a precise approximation of the 
DCG difference. We plan to investigate this problem in 
our future study. 

8. LEARNING WITHOUT DOCUMENT 
GRADES 

When the grades of the documents are not available, 
we can also fit a model to predict the preference of two 
ranking list. To this end, we use a, K x Jf-dimensional 
binary vector s{tt) to represents the ranking tt. The first 
K components of s{n) correspond to the first position of 
the K-position ranking, and the second AT-components 
the second the position, and so on. 

For example, for a ranking list tt 

d3,di,d4,d2,d5 
the corresponding binary vector is 

[0,0,1,0,0, 1,0,0,0,0, 0,0,0,1,0, 0,1,0,0,0, 0,0,0,0,1] 

Given a set of preference over ranking lists, we can ob- 
tain w by solving the following optimization problem: 



T 

mmw w 



(i,j)es 



subject to: 

w"^ {s{ni) ~ s{tVj)) > 1 ~ ^ij ii,j)eS 

C,, >o {i,j)es 



(6) 



(7) 
(8) 



In this case, the constraints wti > Wk,i+i are not in- 
cluded in the optimization problem, since we do not 
have any prior knowledge about the grades of the doc- 
uments. The precision on test set with respect to the 
number of training pairs is reported in Figure 1131 We 
can observe that the w can be precisely learned even 
without the grades of documents. In this case, the 
learned utility function w can be interpreted as the rel- 
evant judgements for the documents. 

9. CONCLUSIONS AND FUTURE WORK 

In this paper, we investigate the coherence of DCG, 
which is an important performance measure in infor- 
mation retrieval. Our analysis show that the DCG is 
incoherency in general, i.e., different gain vectors can 
lead to different judgements about the performance of 
a ranking function. Therefore, it is a vital problem to 
select reasonable parameters for DCG in order to obtain 
meaningful comparisons of ranking functions. We pro- 
pose to learn the DCG gain values and discount factors 
from preference judgements of ranking lists. In particu- 
lar, we develop a model to learn DCG as a linear utility 
function and formulate the method as a quadratic pro- 
gramming problem. Preliminary results of simulation 
suggest the effectiveness of the proposed method. 

We plan to further investigate the problem of learn- 
ing DCG and apply the proposed method in real world 
data sets. Furthermore, we plan to generalize DCG to 
nonlinear utility functions to model more sophisticated 
requirements of ranking lists, such as diversity and per- 
sonalization. 
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